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Abstract. The traditional analysis of Internet chat room discussions places a resource burden on the intelligence 
community because of the time required to monitor thousands of continuous chat sessions. Chat rooms are used 
to discuss virtually any subject, including computer hacking and bomb making, creating a virtual sanctuary for 
criminals to collaborate. Given the upsurge of interest in homeland security issues, we have developed a text 
classification system that creates a concept-based profile that represents a summary of the topics discussed in a 
chat room or by an individual participant. We then discuss this basic chat profiling system and demonstrate the 
ability to selectively augment the standard concept database with new concepts of significance to an agent. 
Finally, we show how an investigator can, once alerted to a user or session of interest via the profile, retrieve 
details about the chat session through our chat archiving and search system. 


1 Introduction 


One rapidly growing type of communication on the Intemet is Instant Messaging (IM) and Internet Relay Chat 
(IRC). Once primarily popular with teenagers, its use has exploded among adults and it is quickly expected to 
surpass electronic mail as the primary way in which people interact with each other electronically by the year 2005 
[7]. Online chat provides a mode of talking to others using a keyboard as in e-mail, but it has the spontaneity of 
real-time human interaction much like that of a telephone call. The fame of IM has been sparked by the ability to 
stay connected with associates, friends, and family, and its entertainment value through meeting new people, 
swapping files, and participation in town-hall style chat rooms to discuss virtually any imaginable topic. 
Unfortunately, criminals have also migrated to the electronic chat havens, sneaking into homes to lure children [5], 
committing corporate and homeland espionage [4], and even discussing terrorist plots with other outlaws [3, 14]. 

Instant messaging is defined as a client-based peer-to-peer chat discussion occurring between small numbers of 
participants. Client-based chat software revolves around direct connections between users; and a central directory 
server is used to broadcast the availability of each client. Since chat traffic is not broadcast through a centralized 
server, no central authority exists to monitor or enforce good behavior. Popular IM products include MSN 
Messenger, Yahoo Messenger, and Trillian. In contrast to IM, server-based chat takes place on one or more 
dedicated servers, where all chat traffic is first transmitted through a server that then transmitted to the appropriate 
recipients. IRC is server-based, wherein participants log in with client software such as XChat or mIRC. One 
server is capable of handling hundreds of simultaneous users and chat rooms.! Participants join and create new 
rooms to chat with groups and also to privately whisper with one another. 

While most participants in chat rooms are engaged in legitimate information exchanges, chat rooms can also be 
used as a forum for discussions of dangerous activities. By detecting discussions associated with terrorism or illegal 
activities, crime prevention abilities may be enhanced. For example, a terrorist may be looking to recruit like- 
minded people to his cell, or child predators may be attempting to lure children. To automatically identify 
suspicious conversations and/or individuals, intelligence workers need an archive of online chat discussions and 
sophisticated tools to help them filter and process the large quantity of collected information. Human monitors 
could actively watch and decide when chat is of concern to law enforcement, but that would be prohibitively 
expensive and people are unlikely to participate in a chat system that is openly scrutinized by human agents. 

Online chat provides a new tool to support criminal collaborations. Intelligence agencies need tools to analyze 
and archive suspicious subjects’ online chat, whether it is obtained in public chat rooms or through covert analysis 


| A February 10, 2004 sampling of chat networks indicates 131,832 clients in 46,345 rooms on EFnet; 125,350 clients in 47,080 
rooms on Undernet; and 107,832 clients in 54,536 rooms on IRCnet. Most people do not remain connected 24/7, so the user 
base is several times larger. 


of all chat data passing through a particular server via packet sniffing [11]. Since chat sessions can span several 
days or even months, agents need a way of recording and tracking topics generated from chat discussions, analogous 
to FBI analysis of e-mail [16]. With the increased emphasis on security issues since September 11, 2001, agencies 
would benefit from identifying suspicious electronic conversations automatically. 

Our goal is to assist in crime detection and prevention by automatically creating concept-based profiles 
summarizing chat session data. We have created ChatTrack, http://www.ittc.ku.edu/chattrack, a system that can 
generate a profile on the topics being discussed in a particular chat room or by a particular individual. This 
summary allows analysts to focus their attention on people or discussions whose profiles intersect with domains of 
national security interest. For example, a profile showing active participation in topics of “bacteria cultures/anthrax” 
and “water reservoirs” could be significant to counterterrorism intelligence, whereas an individual pursuing the topic 
of “instruments/violins” could be disregarded (unless investigating a Stradivarius heist). Because the profiles 
created are summaries, individual privacy is not violated unless further investigation of the session is conducted, 
requiring further permissions. We discuss our method of profiling chat conversations and the search mechanism 
utilized for more in-depth analysis, and provide sample results from data collected on IRC. 


2 Literature Review 


Topic identification from text documents is a longstanding problem. However, since online chat is structured 
differently from written discourse, this poses new challenges for classification systems. Because chat sessions do 
not contain linear discussions of single topics, but rather partially threaded, interwoven topics that oscillate in short, 
incomplete utterances, topic detection is even more difficult for chat sessions. Studies have started to look at the 
current problems facing chat room topic detection and monitoring. 

A chat room’s topics vary based on its participants’ current interests, so it is impossible for a person to know 
what topics are being discussed in a chat room without first joining it to observe its contents. A conversation- 
finding agent called Butterfly aims to resolve this by visiting rooms to sample thirty seconds of data that is then used 
to create a simple term vector [15]. People searching for chat topics request recommendations by querying with 
words in which they are interested, also represented as a term vector. An “interest profile” is generated by summing 
the weights of the dot product of the two vectors; and when the sum is above a threshold, the chat room is 
recommended. One limitation is that Butterfly visits only rooms it can see; however, it is the “secret rooms” that are 
considered the most useful and interesting to the intelligence community. 

E-mail messages are typically threaded; ie., a reply to a message is grouped with the original message. 
Conversely, chat turns are not taken between participants. A response to one utterance may come several minutes 
after it is spoken, opening the potential for confusion if a large number of people are present. [17] looks at 
automatic techniques for discovering, threading, and retrieving material in streams of data. Topic detection and 
tracking aims to discover the tracking of topics and new events. Segmentation is used to find boundaries between 
topics, followed by a logistic regression to combine probabilities obtained from a language model built from 
concatenated training events. These topics are then clustered to determine new events from pre-existing ones. Since 
news events were used as test data, it is reasonable to suspect that chat utterances may not provide sufficient 
keywords. Additional studies using techniques borrowed from text classification may assist on threading chat 
utterances. 

Comprehending the interests and interactions between the participants of chat rooms can be used to create a 
profile of an individual’s behavior. [18] uses a metaphor for creating individual visual data portraits and to represent 
the interactions between users. The data portrait combines attributes of user activity such as utterances spoken, the 
rate of responses, and relationships between users, such as groups of users avoiding outsiders. Attributes such as 
timing between utterances, responses, initial chat vs. reply, are included in the portrait, and change over time to 
represent a temporal view. A garden metaphor represents one group, and groups are visually compared to indicate 
how interactions differ, such as a dominating voice or a democratic group with equal participation. 

Topic classification requires a domain of expertise for each concept. IBM has developed WebFountain, which 
makes use of chat room archives, web pages, e-mail, and so on, which can then be queried for knowledge of a 
specific domain [2, 13]. This could provide intelligence agents a means for assembling bodies of knowledge on 
concepts of national security interest, such as a terrorist organization. [6] evaluated text classification techniques 
(Naive Bayes, k-nearest neighbor, and support vector machine) to find suitable methods of chat log classification to 
assist in automated crime detection. Training data was limited to four categories, limiting topic distinctions to law 


enforcement analysis. [1] looks at topic identification in chat lines and news sources using complexity pursuit. This 
algorithm separates interesting components from a time series of data, identifying some underlying hidden topics in 
streaming data. The authors report that their results could complement queries with other topic segmentation and 
tracking methods on text streams. 

Our work complements the above systems. If PeopleGarden were combined with a concept-based profiler such 
as ChatTrack, this would be a tremendously useful security tool — closed groups discussing topics of national 
security could be flagged for analysis by intelligence agents. We focus on topic detection performed by an 
individual on chat data. Combining information from multiple sources, such as is done in WebFountain, would 
increase the accuracy of the profiles and being able to track the profiles over time, as is done by topic tracking 
systems, would also be a valuable enhancement. 


3 Topic Detection for Security Analysis 


We have developed ChatTrack, a collective set of applications and tools for the analysis and archival of electronic 
chat room discussions. As shown in Figure 1, ChatTrack archives the contents of chat room discussions in XML 
format. Subsets of this data can be classified upon request to create profiles for particular sessions or particular 
users. The classifier is pre-trained on over 1,500 concepts, and it can also be extended to include concepts of 
interest to a particular analyst. Thus, the profiles created can be easily adapted to a variety of domains. The data is 
also indexed for quick search by speaker, listener, keyword, session, and/or date and time. 
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Fig. 1. The ChatProfile and ChatRetrieve systems are major components of ChatTrack and provide Web-based user interfaces 


3.1 Archiving 


There are two perspectives from which one can choose to log activity on server-based systems; the first uses an 
autonomous bot agent, and the second uses a modified chat server. To collect samples for testing, we obtained chat 
session data from public chat rooms on various IRC networks and from our own modified server. 

ChatTrackBot is designed to silently record activity from chat rooms of which it is a member. It is controlled by 
the owner, who sends it commands indicating the room names to join and part. The bot appears as a normal user 
but, unlike some chatbots, it does not attempt to fool others into thinking it is human by responding to participants’ 
queries, and appears as an end-user IRC client program when queried by the IRC client-to-client protocol. For 
testing purposes, we augmented an IRC server to record all chat activities (including public and whispered 
utterances). While bots are limited to recording only the activity visible to them, a modified server is capable of 
logging all system activity. 

Both ChatTrackBot and our modified IRC server store the archived data in XML format. We designed our own 
XML schema to identify essential chat actions such as sending/receiving a message, joining/parting a chat room, 
logging on/off a server, nickname changes, and so on. The XML chat data is parsed to extract the chat utterances, 


and then each utterance is parsed and stored into three files, separately storing the speaker and receiver names and 
the utterance. The receiver name consists of either the public chat room in which the utterance was spoken or the 
username to which the utterance is being privately sent. These files are placed in a directory hierarchy archive that 
encodes the date and room name identifier, and the filenames encode the utterance transmission time. 

The ChatTrack project page, http://www.ittc.ku.edu/chattrack, makes many of these products available. The 
XML schema, modified IRC server, and ChatTrackBot are all available there. In addition, we developed and 
distribute ChatLog, a library written in “C,” that can be used by chat developers to extend their own client and/or 
server-based chat software to create XML log files. 


3.2 ChatProfile 


As chat data is archived, an intelligence agent needs an automated way of summarizing and filtering the chat 
utterances. We addressed this by designing ChatProfile, a system that uses text classification to create a profile of 
chat data, in essence creating a summary of the chat. Profiles indicating topics of concern can then prompt the 
analyst to explore the associated chat session in more detail using ChatRetrieve, as discussed in Section 3.3. 

As the basis of ChatProfile, we used a vector space classifier developed as part of the KeyConcept project [8]. 
The classifier must first be trained, i.e., taught what concepts are to be searched for in the chat sessions. During 
training, the classifier is provided with a set of pre-defined concepts and example text that has been classified into 
the concepts, taken from of the Open Directory Project (ODP). Then, for each category, the classifier creates a 
vector of representative keywords and their weights based upon the ¢tf*idf (term frequency * inverse document 
frequency) formula developed for vector space search engines [12]. We trained our classifier on 1,565 concepts 
from the top three levels of the ODP hierarchy. These were selected because they had a minimum of 28 training 
documents, found through empirical studies to provide adequate training data for accurate classification [10]. 

To create a profile from chat data, an analyst selects the criteria of interest using any combination of session 
name, speaker/receiver names, and a date/time range, as depicted by the web-based interface in Figure 2. A user 
profile focuses on one chat participant by filtering based on a speaker name only, whereas a session profile filters by 
chat room name only. The chat utterances fulfilling the analyst’s criteria are then collected from the archive and 
prepared for classification by removing high frequency (stop) words and stemming by the Porter stemmer. The 
classification of several hundred chat utterances generally completes in under a minute. 

The classifier then creates a vector of keywords extracted from the chat data and calculates the similarity between 
this vector and the vectors for each trained concept using the cosine similarity measure [12]. The concepts are then 
sorted by their similarity values in decreasing order, and the top-matching concepts and their similarity values are 
returned as the result of classification. Concepts are visually represented in a hierarchical fashion using a Web 
interface, with asterisks shown next to each concept to indicate the relative concept weights at each level of the 
hierarchy. 


2 See http://dmoz.org 
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Fig. 2. The ChatProfile filtering interface allows an agent to select criteria of interest for inclusion in a user or session profile 


Envision a scenario in which an agent wishes to generate a session profile on the #american-politics chat room. 
(Test data was obtained from a public IRC server and includes 7,612 utterances from 225 chat participants.) As 
depicted in Figure 2, the #american-politics session is selected from the drop-down session list, and the agent selects 
filtering for activity occurring on February 3, 2004. Listener and Speaker ID fields are left blank so as to include 
all participants, and a profile is then generated. Examining the results shown in Figure 3, the three asterisks next to 
News (compared to one or zero asterisks for all other top level concepts), indicate that the majority of the chat falls 
under the “News” concept. When the News concept is expanded, we find that the most highly weighted sub- 
concepts are Politics and Alternative Media. Further expansion reveals that the profile for the chat room is heavily 
weighted in “US Election 2004” and “Conservative and Libertarian.” 
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Fig. 3. Results of a session profile generated from one day of chat from #american-politics 


The agent may now be interested in learning more about a particular participant’s interests. This can be 
determined by creating a user profile. A user profile may be generated on all utterances spoken throughout an entire 
chat system to find a participant’s global interests, or merely their interests revealed in a single chat session. By 
creating an individual’s profile in a variety of rooms, and comparing them, an agent may be able to identify users 
who are spoofing their identities, i.e., behaving one way in one room but entirely differently in other rooms or at 
other times. By comparing an individual’s profile to the session profile, an agent may be able to identify a user that 
is conducting “off-topic” conversations and may be holding a private meeting for illicit reasons. Figure 4 reveals the 
profile generated on a single chat participant for the same session used in Figure 3. This particular participant spoke 
46 utterances in the session, shown in the right hand pane, and his/her profile interests seem focused on societal 
issues dealing with abortion. All actual usernames in the utterances have been masked for privacy. 
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Fig. 4. Results of a user profile on one chat participant, restricted to #american-politics 
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Fig. 5. Session profile of #hackphreak 
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Fig. 6. Session profile of #hackphreak augmented with concept database supporting “Computers/Hacking” as a concept 


The interests of intelligence agencies evolve as world events take place and new homeland security threats 
appear. To deal with this reality, ChatProfile has been designed to allow new concepts to be added to the concept 
database for profile recognition. To demonstrate this ability, we added training documents for the concepts of 
terrorism and hacking. Training documents were selected from various websites that seemed authoritative on their 
respective topics. An agent would typically make use of a concept database that has been tuned for category 
recognition for the subjects of interest, omitting uninteresting concepts. A session profile created on #hackphreak 
(14,367 utterances from 270 unique participants) indicates a profile in the “Arts” category of music and writing 
(Figure 4). Using the enhanced concept database for terrorism/hacking increases the strength of the “Computers” 
top-level category, and identifies the secondary level of hacking (Figure 5). The terrorism concept is not shown as 
an example, but was successfully identified in experiments. 


3.3 ChatRetrieve 


Once a profile that warrants further analysis has been identified, an intelligence agent needs the ability to recall the 
chat session linked to the profile in question. ChatRetrieve was designed to provide sophisticated retrieval of chat 
data, in essence providing the “details” for chat session analysis. 

In order to keep chat session archive up to date, indexing takes place concurrently as users chat. We use an 
incremental indexer that runs continuously at a predefined interval. If no new chat data exists, the indexer sleeps, 
waiting for new chat utterances to index. When new XML data appears, the logs are parsed as described in Section 
3.1, and the output is added to the inverted indices. Traditional indexers create an index from scratch every time 
new data appears, requiring more time as the data archive size increases. Incremental indexing solves this issue by 
adding the chat utterances received since the previous indexing cycle to a pre-existing index. Thus, chat messages 
are available within minutes or seconds of the time they occur. There is a trade-off penalty for retrieval speed, but a 
re-indexing process could be performed during scheduled maintenance to improve retrieval times. 

Agents can utilize a retrieval application to perform queries on the indexed chat data on a web-based interface so 
that the chat data can remain on a secure server. Keywords retrieval is based on ¢tf*idf. Queries can be based on 
speaker name, keywords, a date range, and a combination of these criteria, as depicted by Figure 7. The user may 
control several results display parameters such as the number of results per page and how many messages to show 
around the matching utterance. The results of querying for keywords “terrorist” and “homeland” belonging to 
speaker “participant42” (name changed for privacy) are shown in Figure 8. On this results page, clicking on a room 
name link performs a chat session replay that includes utterances spoken by all participants of that session. 
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Fig. 7. Screenshot of the ChatRetrieve interface 
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Fig. 8. Results of searching for “terrorist” and “homeland” only spoken by “participant42” 


One unique feature of the ChatRetrieve system is that we track all the chat room participants even if they do not 
contribute to the discussions. The agent may choose to display the listeners who heard particular messages. We are 
currently working on the ability to search by listener and/or profile by listener to allow analysis of users who choose 
to eavesdrop on certain discussions but not actually exchange messages. 


4 Discussion and Future Work 


We have developed a system that automatically generates a conceptual profile from chat data using classification 
techniques. Profiles may be created for chat sessions and/or individual users, filtered on criteria such as listener and 
sender usernames, date/time, and chat room name. Further exploration of chat sessions of interest is facilitated by 
providing search parameters on keyword, speaker, listener and/or date and time. This combination of automatic 
topic identification and chat exploration provides intelligence agents the tools they need to be vigilant in the fight 
against crime by reducing efforts required for the manual review of chat room conversations. 

Our future work is associated with automating temporal analysis of chat room topics. Sudden changes of 
interests in profiles could indicate odd behavior and could be brought the attention of an investigator for analysis. 
For example, a sudden change in a chat room’s profile, such as an upswing in discussions about public water 
facilities, should be flagged for review. Profiling individual users also has the benefit of indicating their interests 
and could be analyzed for sudden shifts of topic as well. 

Due to the informal nature and “non-English” text of chat messages [9] (e.g., “brb — be right back”, “lol — 
laughing out loud”), preprocessing may help to extract meaningful words from the seemingly meaningless chatter. 
We hope to provide investigators the ability to retrieve and replay chat sessions from the chat archive based on 
topics (e.g., “bomb making”) rather than specific words (e.g., “nitroglycerin” or “pipe bomb”). Chat topic 
classification has applications outside of the intelligence community as well. Examples include parents who wish to 
restrict and monitor the types of messages their children send and receive can proactively block dangerous messages 
based on topic; corporations could review and look for unusual chat topics of employees; and chat network 
providers could identify rogue chatters or predators stalking children and forward chat data to the proper authorities. 
Questionable chat rooms could also be identified by administrators searching for violations of server policies. 

We could also consider factors like metrical habits, rhetorical devices, and polysyllabic words to identify authors 
and include these in user profiles, since chat nicknames can change over time. For example, priority could be given 
to investigation of chat logs collected from the utterances of a supposed child, yet demonstrating the vocabulary or 
language model of an adult. Users who have “adult” language models in some sessions and child-like language 
models in others may indicate a child predator, and can be identified in this manner. 
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